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ABSTRACT 

An  interactive  algorithm  for  soft  segmentation  and  matting  of  natu¬ 
ral  images  and  videos  is  presented  in  this  paper.  The  technique  fol¬ 
lows  and  extends  [11],  where  the  user  first  roughly  scribbles/labels 
different  regions  of  interest,  and  from  them  the  whole  data  is  au¬ 
tomatically  segmented.  The  segmentation  and  alpha  matte  are  ob¬ 
tained  from  the  fast,  linear  complexity,  computation  of  weighted  dis¬ 
tances  to  the  user-provided  scribbles.  These  weighted  distances  as¬ 
sign  probabilities  to  each  labelled  class  for  every  pixel.  The  weights 
are  derived  from  models  of  the  image  regions  obtained  from  the  user 
provided  scribbles  via  kernel  density  estimation.  The  matting  results 
follow  from  combining  this  density  and  the  computed  weighted  dis¬ 
tances.  We  present  the  underlying  framework  and  examples  showing 
the  capability  of  the  algorithm  to  segment  and  compute  alpha  mattes, 
in  interactive  real  time,  for  difficult  natural  data. 

1.  INTRODUCTION 

Interactive  image  and  video  segmentation,  and  matting,  where  the 
user  starts  the  automatic  algorithm  by  providing  rough  scribbles  la¬ 
belling  the  regions  of  interests,  has  received  a  lot  of  attention  in  re¬ 
cent  years,  see  for  example  [1,  2,  3,  4,  5,  6,  7,  9,  10,  12,  14,  15]  and 
references  therein,  and  [11]  for  a  discussion  on  these  works  and  the 
key  attributes  of  distance-based  techniques  as  the  one  pursued  in  this 
paper. 

In  order  to  address  the  challenges  of  real-time  interactive  image 
segmentation,  the  authors  of  [1 1]  proposed  to  exploit  the  colorization 
work  in  [18],  where  the  goal  is  to  add  color  (or  other  special  effects) 
to  a  given  mono-chromatic  image  following  color  hints  provided  by 
the  user  via  scribbles  (see  also  [8]).  The  added  color  depends  on 
the  geodesic  distance  between  the  scribble  and  the  pixel  being  pro¬ 
cessed.  Being  more  specific,  let  s  and  t  be  two  pixels  of  the  image 
Q  and  Cs,t  a  path  over  the  image  connecting  them.  The  geodesic 
distance  between  s  and  t  is  defined  by: 

d(s,t):=min  f  W dp ,  (1) 

cs,t  Jo 

where  p  stands  for  the  Euclidean  arc-length  and  W  is  a  weight  that 
depends  on  the  application  (see  below).  This  distance  (1)  can  be 
efficiently  computed  in  linear  time  [17],  making  the  algorithm  vir¬ 
tually  real  time  for  interactive  applications.  Let  now  Dc  be  the  set 
of  pixels  labelled  by  the  user  provided  scribbles  k,  i  E  [1,  Nf,  with 
color  indications  in  [18]  or  segment  labels  in  [11].  Then,  the  dis¬ 
tance  from  a  pixel  t  to  a  single  label  U,  i  E  [1,  A7],  is  di(t)  = 
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minsGQc  .  iabei{s)=ii  d(s,t),  and  the  probability  P(t  E  U)  for  t  to 
be  assigned  to  the  label  k  representing  the  class  (color  or  segment)  i 

is  given  by  Pr(£  €  h)  =  ~ ■ 

U$<E[l,Nl]d3(d) 

In  [18],  W  —  |Vy  •  Cs,t(p)  |,  where  Y  is  the  given  luminance 
(and  the  gradient  is  2D  for  images  and  3D  for  video),  and  this  prob¬ 
ability  Pr(t  E  U)  weights  the  amount  of  color  the  pixel  t  receives 
from  the  color  in  the  scribble  (label)  U. 

For  segmentation,  in  order  to  compute  W,  [11]  starts  by  mod¬ 
elling,  via  a  Gaussian  PDF,  each  region  of  interest  from  the  collec¬ 
tion  of  pixels  labelled  by  the  user  via  the  scribbles  U  (one  Gaussian 
PDF  per  label  U).  From  this  PDF,  the  likelihood  of  a  pixel  to  be¬ 
long  to  the  same  class  as  label  U  is  derived  considering  competing 
PDF’s  (competing  labels).  When  multiple  colors  and  channels  are 
used,  these  likelihoods  are  further  weighted  according  to  the  capa¬ 
bility  of  each  channel  to  discriminate  between  the  provided  labels 
(these  weights  are  automatically  computed).  These  likelihoods  form 
the  basis  for  the  computation  of  W  in  (1),  see  below  and  [11]  for 
complete  details. 

In  this  paper  we  extend  the  work  in  [11]  at  a  number  of  impor¬ 
tant  levels.  First,  with  enhanced  models  for  the  scribbled/labelled 
pixels  provided  by  the  user,  we  significantly  reduced  the  user  effort 
and  further  improve  the  computational  time.  Improvements  are  also 
obtained  following  a  two  stage  application  of  the  above  described 
segmentation  approach  (with  the  enhanced  models).  Second,  we 
compute  explicit  alpha  matting  (foreground  opacity),  based  on  the 
geodesic  distance  combined  with  a  function  of  the  actual  pixel  value. 
This  is  critical  for  composition  applications.  Third,  we  extend  the 
work  to  video.  The  rest  of  this  paper  presents  these  enhancements 
and  numerous  examples. 

2.  SEGMENTATION  AND  MATTING  FRAMEWORK 

We  now  present  the  basic  extensions  mentioned  above. 

2.1.  Improved  Foreground  and  Background  Models 

Following  the  distance  based  work  [1 1],  we  first  propose  a  more  gen¬ 
eral  model  for  the  labelled  pixels  provided  by  the  user.  In  [11],  the 
user  specifies  scribbles  on  each  “uniform”  region,  in  which  the  pixel 
features  (intensities,  colors,  or  filtered  responses)  are  assumed  to  be 
samples  from  a  single  Gaussian.  Then,  as  briefly  mentioned  above, 
the  algorithm  computes  the  likelihoods  and  weighted  distances  for 
every  pair  of  competing  foreground/background  scribbles.  This  puts 
on  the  user  the  burden  to  scribble  many  regions,  virtually  one  per 
uniform  region  in  the  image/video,  process  which  becomes  very  te¬ 
dious  for  complicated  images.  Ideally,  we  would  like  the  user  to 
just  provide  a  single  scribble  for  the  foreground  and  a  single  one 
for  the  background,  or  in  general,  a  single  scribble  per  region  the 


user  wants  to  label  together.  Aiming  at  this  goal,  we  enhance  the 
Gaussian  model  via  the  standard  non-parametric  kernel  density  es¬ 
timation  [13].  The  user  places  single  scribbles  roughly  across  the 
foreground  (F)  and  background  (B)  and  let  them  automatically  prop¬ 
agate  throughout  the  image  via  the  fast  weighted  distance  compu¬ 
tations.  In  contrast  to  [11],  where  the  weights  W  in  (1)  are  linear 
combination  of  likelihoods  from  a  set  of  channels,  we  use  the  gradi¬ 
ent  of  these  likelihoods  (in  agreement  with  [18]),  which  shows  better 
responses  to  strong  edges. 

Specifically,  let  Qf  be  the  set  of  pixels  with  label  F  and  Qb 
those  corresponding  to  the  background.  We  first  estimate  the  PDF 
Pr(x\F)  of  Qf,  in  Luv  color  space,  via  kernel  density  estimation,1 
where  x  is  a  color  vector.  The  likelihood  Pf  (x)  of  a  given  pixel  x 
to  belong  to  F  according  to  this  PDF  computation  is  then  expressed 
as 


Pf(x)  = 


Pr(pc\F) 

Pr(x\F)  +  Pr(x\B)  ’ 


(2) 


and  Pb(x)  =  1  —  Pf(x).  We  employ  the  well-developed  Fast 
Gaussian  Transform  algorithm  to  efficiently  calculate  this  probabil¬ 
ity,  e.g.,  [16].  The  weighted  distance  (geodesic)  from  each  of  the 
two  labels  for  every  pixel  x  is  then  computed  as 


di(x)  —  min  d(s,  x),  Ie{F,B},  (3) 

where  d(s,x)  is  the  distance  defined  in  (1)  with  weights  W  com¬ 
puted  as  in  [11,  18],  from  (the  gradient  of)  the  modified  likelihood 
described  above.  From  this  weighted  distance,  the  probability  of  as¬ 
signment  can  be  obtained  as  explained  in  the  Introduction. 


(a)  (b)  (c) 


Fig.  1.  (a)  The  user  provides  foreground  (blue)  and  background 
(green)  scribbles.  The  binary  segmentation  boundary  is  shown  in 
a  white  line.  The  segmentation  is  obtained  by  selecting  for  every 
pixel  the  corresponding  label  with  minimal  distance,  (b)  The  result¬ 
ing  alpha  mask,  (c)  Foreground  segments  composite  on  blue  screen 


Fig.  2.  Figures  (a)-(d)  show  the  user  inputs  and  results  from  [11]. 
Figures  (e)-(h)  correspond  to  the  new  inputs  and  results  for  the  same 
images,  leading  to  similar  results  with  less  user-marked  scribbles. 


2.2.  Alpha  Matting  Computation 

As  detailed  before,  this  distance  can  be  used  for  color  blending,  [18], 
or  soft  segmentation  [11].  We  now  extend  this  work  to  obtain  an  ex¬ 
plicit  estimate  for  the  alpha  value  so  that  our  framework  can  intrinsi¬ 
cally  handle  image  matting  problems.  The  alpha  channel  is  explicitly 
computed  as 


m[x)  =  di(x)~r  -Pi(x),  l  e  {F,  B},  (4) 


a(x)  — 


ujf(x) 

ujf(x)  +  ujb{x)  ’ 


(5) 


where  r  is  a  constant  trading  between  the  distances  and  the  probabil¬ 
ities.  In  our  experiments,  r  is  typically  between  0  and  2.  Intuitively, 
pixels  that  are  close  to  a  scribble  (in  the  weighted  distance  sense), 
and  have  similar  colors  (in  the  likelihood  sense  following  the  kernel 
density  estimation),  are  assigned  higher  probabilities  for  the  region 
represented  by  that  scribble.  This  alpha  matting  combines  both  the 
weighted  distance  and  the  probability  previously  estimated,  and  it 
is  much  more  efficiently  computed  (interactive  real  time)  and  with 
(at  least)  competitive  results  when  compared  with  those  works  men¬ 
tioned  in  the  introduction. 

As  a  consequence  of  the  improved  image  modelling  via  kernel 
density  estimation  and  the  explicit  alpha  matting  computation,  we 
significantly  reduce  the  user  input  and  are  capable  of  dealing  with 
difficult  images  such  as  the  one  in  Figure  1.  A  comparison  with  [11] 
is  presented  in  Figure  2. 


1  Kernel  density  has  superior  performance  and  computational  times  than 
models  such  as  mixtures  of  Gaussians. 


2.3.  Interactive  Refinement 

The  proposed  algorithm  (as  the  ones  in  [11,  18]),  allows  the  users 
to  interactively  add  new  scribbles  to  achieve  the  desired  segmenta¬ 
tion  in  a  progressive  fashion.  By  learning  from  the  samples  on  the 
new  scribbles  being  added,  the  weighted  distances  get  updated  and 
the  new  segmentation  result  is  shown  to  the  user.  This  process  is 
repeated  until  the  desired  segmentation  is  obtained.  Figures  3  and 
4  (image  from  the  authors  of  [12])  illustrate  the  process  of  adding 
one  new  foreground  scribble  F2  to  the  image.  The  distance  of  every 
pixel  to  the  foreground  labels,  as  defined  previously,  is  updated  to 
the  smaller  value,  i.e.,  dp  =  min{dFi ,  dp2  }•  The  propagation  of  F2 
stops  once  dF2  exceeds  dF±  or  ds  .  When  computing  the  distance  to 
the  new  scribble  F2,  we  use  the  weights,  or  probabilities/likelihoods, 
between  only  F2  (not  Fi)  and  the  previous  background  scribbles, 
giving  more  accurate  local  color  estimation. 

2.4.  Additional  Speedup  Strategies 

Additional  computational  improvements  can  be  obtained  motivated 
by  the  assumption  that  an  object  can  have  semi-opacities  only  around 
its  border  and  alpha  should  be  solid  elsewhere.  This  holds  true  for  a 
large  variety  of  natural  images  and  videos.  The  main  idea  then  is  to 
quickly  find  an  approximate  boundary  and  generate  a  narrow  stripe 
around  it  (trimap).  Then  the  refined  computation  is  limited  within 
this  stripe.2 

In  the  first  stage,  we  decompose  the  Luv  color  space  into  three 

2If  the  band  large  enough  such  that  it  ends  with  zero  width  around  the  user 
provided  scribbles  (basically  covering  the  whole  image),  we  are  back  into 
the  previously  described  framework,  thereby  no  obtaining  any  computational 
speedup  while  remaining  fully  general. 


Fig.  3.  The  effect  of  adding  a  new  F  scribble.  Dotted  line  shows  the 
equal- distance  line.  The  F2  scribble  only  propagates  in  a  limited 
area. 


(a)  (b) 


Fig.  4.  An  example  showing  how  the  user  adds  a  new  scribble  to  fix 
the  misclassified  hair  region. 


channels,  each  of  which  is  quantized  into  256  levels.  A  pixel’s  like¬ 
lihood  is  quickly  obtained  by  multiplying  via  a  look-up  table  the 
three  independently  estimated  probabilities  (from  the  user  provided 
scribbles  as  detailed  above).  A  binary  segmentation  follows,  Figure 
5(a),  which  would  be  less  accurate  than  the  full  model  described  be¬ 
fore,  but  often  good  enough  to  get  a  rough  initial  segmentation.  In 
the  second  stage,  a  narrow  band  is  spanned  by  a  distance  transform 
and  its  borders  serve  as  new  foreground  and  background  scribbles 
(Figure  5(b)),  parameterized  by  t  £  [0, 1]  (periodic  with  period  1 
if  the  contour  is  closed).  The  band- width  depends  on  the  data  and 
can  be  interactively  adjusted  by  the  user.  The  likelihoods  for  pixels 
inside  the  band  are  then  locally  recomputed  using  the  feature  vector 
(L,  u,  v,  t),  giving  more  accurate  local  estimation.  Then  we  proceed 
as  before  to  segment  with  the  weighted  distance  approach.  This  two- 
step-framework  further  reduces  the  user  intervention  and  computa¬ 
tional  time  yet  makes  the  algorithm  more  robust  and  accurate,  see 
Figure  6  for  examples. 

Our  algorithm  preserves  the  important  linear  complexity.  With¬ 
out  any  code  optimization,  it  runs  for  0A4sec  and  3.36sec  (exclud¬ 
ing  the  user  operating  time)  for  images  with  sizes  480  x  452  and 
1500  x  1500  respectively,  on  a  1.7 GHz  CPU  with  512  MB  RAM. 


3.  EXTENSION  TO  VIDEOS 

Our  framework  can  be  easily  extended  to  videos,  which  can  be  mod¬ 
elled  as  3D  images  (no  explicit  optical  flow  computation,  see  also 
[18]).  Instead  of  cutting  out  a  region,  the  algorithm  segments  a 
spatio-temporal  tube.  The  user  scribbles  on  one  or  more  frames  and 
then  they  propagate  throughout  the  whole  video  (weighted  distances 
in  3D).  See  results  in  figures  7  and  8. 


Fig.  5.  (a)  A  hard  segmentation  is  quickly  found  by  a  few  scribbles. 
The  white  line  indicates  the  binary  segmentation  boundary,  (b)  Au¬ 
tomatically  generated  trimap  and  new  scribbles  parameterized  by  t. 
(c)  Segmented  result. 


Fig.  6.  Five  additional  examples.  For  each  set,  the  user  places  a  few 
scribbles  to  obtain  different  objects  of  interest  (left),  computed  seg¬ 
mentation  and  matting  (middle),  and  composition  into  a  new  back¬ 
ground  (right). 


4.  CONCLUSIONS  AND  FUTURE  WORK 

In  this  paper  we  presented  a  distance-based  algorithm  for  (interac¬ 
tive)  real-time  natural  image  and  video  segmentation  and  matting. 
Following  the  work  of  [11],  we  introduced  a  number  of  improve¬ 
ments  which  greatly  simplify  user  input,  reduce  computational  com¬ 
plexity,  and  produce  pleasant  matting  results.  Various  difficult  ex¬ 
amples  were  presented  supporting  this.  We  are  currently  working  on 
further  improving  the  video  results  to  handle  more  difficult  scenarios 
with  occlusions  and  dynamic  background. 


Fig.  7.  Two  examples  of  video  segmentation  (pair  of  left  and  pair 
of  right  columns;  first  six  rows).  The  user  draws  scribbles  on  one 
frame  of  the  video  and  the  algorithm  automatically  segments  the 
whole  video  (a  total  of  50  and  72  frames).  The  last  row  shows  an 
example  of  video  composite. 


Fig.  8.  Third  video  example  (a  total  of  66  frames). 
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